Introduction

About EDA
Exploratory Data Analysis (EDA) is the process of conducting a first pass at a new or poorly understood set of data with the goal of exploring the shape, structure, and contents of the dataset.
EDA can help Analysts and Data Scientists to understand the scope of the data and how much work will likely be required to clean and transform the dataset into a usable and useful format. The purpose of EDA is not necessarily to answer questions about the content of the dataset, but to find valuable questions to ask of it. This is accomplished through a number of techniques ranging from simple summary statistics and data visualization to relatively sophisticated modelling to understand the nature of patterns and relationships in the data.
EDA can be accomplished via a number of means however language based approaches offer inherent repeatability and reproducibility, along with the option to document the potentially complex processes of data transformation and cleaning in-line with the code. Additionally, work done through scripting language is readily extensible for further development and can form the framework for rapid prototyping and eventually production.


About the R Language
R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. - https://www.r-project.org/
The R language was developed primarily to work with data and statistical analysis and as such is uniquely suited for EDA work. It is an open source language with a number of packages that extend it’s functionality and has an incredibly collaborative and supportive user base.


Relevant Operator Definitions
<- functions as the assignment operator in R and can be read as from right to left as
[This_Variable_Name] <- (assign as) [This_Object]

%>% is the pipe operator in the R Tidyverse and is used to string together operations. It can be read from left to right as
[Do this operation] %>% (and then) [Do this other operation]



Exploratory Analysis

Load Frequently Used Packages

library(tidyverse)
library(plotly)

The tidyverse is a metapackage comprised of multiple packages designed to extend the functionality of the R language and to work well together.
The plotly package creates client or server side interactive data visualizations.



Import Data

bob_ross_df <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-02-21/bob_ross.csv')

This is a publicly available dataset featured in the weekly Tidy Tuesday R community github repository.



What is the shape and structure of the dataset?

Dataset shape

glimpse(bob_ross_df)
## Rows: 403
## Columns: 27
## $ painting_index   <dbl> 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292…
## $ img_src          <chr> "https://www.twoinchbrush.com/images/painting282.png"…
## $ painting_title   <chr> "A Walk in the Woods", "Mt. McKinley", "Ebony Sunset"…
## $ season           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,…
## $ episode          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1, 2, 3, 4…
## $ num_colors       <dbl> 8, 8, 9, 3, 8, 4, 8, 8, 8, 8, 8, 4, 8, 12, 12, 13, 3,…
## $ youtube_src      <chr> "https://www.youtube.com/embed/oh5p5f5_-7A", "https:/…
## $ colors           <chr> "['Alizarin Crimson', 'Bright Red', 'Cadmium Yellow',…
## $ color_hex        <chr> "['#4E1500', '#DB0000', '#FFEC00', '#102E3C', '#021E4…
## $ Black_Gesso      <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE,…
## $ Bright_Red       <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRU…
## $ Burnt_Umber      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ Cadmium_Yellow   <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRU…
## $ Dark_Sienna      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ Indian_Red       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ Indian_Yellow    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ Liquid_Black     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ Liquid_Clear     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ Midnight_Black   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ Phthalo_Blue     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ Phthalo_Green    <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRU…
## $ Prussian_Blue    <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ Sap_Green        <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRU…
## $ Titanium_White   <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ Van_Dyke_Brown   <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,…
## $ Yellow_Ochre     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
## $ Alizarin_Crimson <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRU…

The dataset is comprised of 403 rows across 27 columns. Each record in the dataset appears to be a single episode with descriptive information.



Dataset structure

library(skimr)

skim(bob_ross_df) %>% 
  arrange(desc(logical.mean))
Data summary
Name bob_ross_df
Number of rows 403
Number of columns 27
_______________________
Column type frequency:
character 5
logical 18
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
img_src 0 1 49 51 0 403 0
painting_title 0 1 8 27 0 401 0
youtube_src 0 1 41 42 0 403 0
colors 0 1 18 261 0 176 0
color_hex 0 1 11 165 0 175 0

Variable type: logical

skim_variable n_missing complete_rate mean count
Titanium_White 0 1 0.99 TRU: 400, FAL: 3
Alizarin_Crimson 0 1 0.94 TRU: 380, FAL: 23
Van_Dyke_Brown 0 1 0.92 TRU: 371, FAL: 32
Cadmium_Yellow 0 1 0.86 TRU: 346, FAL: 57
Yellow_Ochre 0 1 0.81 TRU: 327, FAL: 76
Phthalo_Blue 0 1 0.80 TRU: 323, FAL: 80
Bright_Red 0 1 0.80 TRU: 321, FAL: 82
Midnight_Black 0 1 0.79 TRU: 317, FAL: 86
Sap_Green 0 1 0.76 TRU: 306, FAL: 97
Indian_Yellow 0 1 0.72 TRU: 292, FAL: 111
Dark_Sienna 0 1 0.72 TRU: 290, FAL: 113
Prussian_Blue 0 1 0.65 TRU: 263, FAL: 140
Phthalo_Green 0 1 0.29 FAL: 287, TRU: 116
Black_Gesso 0 1 0.25 FAL: 302, TRU: 101
Burnt_Umber 0 1 0.14 FAL: 348, TRU: 55
Liquid_Clear 0 1 0.13 FAL: 352, TRU: 51
Liquid_Black 0 1 0.03 FAL: 389, TRU: 14
Indian_Red 0 1 0.00 FAL: 402, TRU: 1

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
painting_index 0 1 208.71 117.90 4 107.5 210 310.5 411 ▇▇▇▇▇
season 0 1 16.00 8.96 1 8.0 16 24.0 31 ▇▇▇▇▇
episode 0 1 7.00 3.75 1 4.0 7 10.0 13 ▇▅▇▅▇
num_colors 0 1 10.61 2.38 1 9.0 11 12.0 15 ▁▁▃▇▃

Thankfully this dataset has no records with missing information and requires little to no cleaning and modification before analysis can begin. Note that the use of colors show by num_colors in the numeric variables appears to have a slightly left skewed distribution.



Further EDA

library(DataExplorer)

bob_ross_df %>%
  group_by(season) %>%
  summarise(across(c(Black_Gesso:Alizarin_Crimson),
                   mean)) %>%
  ungroup() %>% 
  plot_correlation(title = 'Color Use per Season Correlation Plot',
                   theme_config = list(title = element_text(face = 'bold'),
                                       axis.text.x = element_text(angle = 90),
                                       legend.position = 'bottom'))
Fig 1, Color Use per Season Correlation Plot

Fig 1, Color Use per Season Correlation Plot

The DataExplorer package is an incredibly powerful tool for exploring datasets in depth with minimal coding. Here we see one function from the package for creating correlation plots. It looks like there are several possible color and season correlations worth exploring further.



How many episodes did he record per season?

episode_count_df <-
  bob_ross_df %>%
  count(season, name = 'episodes_per_season') %>%
  count(`episodes_per_season`, name  = 'number_of_seasons')

episode_count_df
## # A tibble: 1 × 2
##   episodes_per_season number_of_seasons
##                 <int>             <int>
## 1                  13                31

Bob recorded 31 seasons comprised of 13 episodes each!



Did the number of colors used change per season?

library(htmlwidgets)

bob_ross_df %>%
  transmute(
    episode = paste0('S', season, ':Ep', episode),
    srt = row_number(),
    num_colors,
    painting_title,
    img_src
  ) %>%
  plot_ly(
    x = ~ fct_reorder(episode, srt),
    y = ~ num_colors,
    color = I('#1A237E'),
    type = 'scatter',
    mode = 'lines+markers',
    text = ~ paste0(
      'Episode: <b>',
      episode,
      '</b><br>',
      'Count of Colors: <b>',
      num_colors,
      '</b><br>',
      'Title: <b>',
      painting_title,
      '</b><br>',
      '<i>click point to open painting</i>'
    ),
    hoverinfo = 'text',
    customdata = ~ img_src
  ) %>%
  layout(
    title = list(text = '<b>Colors Used per Episode</b>\n<i> hover over points for more info</i>',
                 x = 0.05),
    xaxis = list(title = FALSE),
    yaxis = list(title = '<b>Color Count</b>')
  ) %>%
  onRender(
    "
function(el, x) {
  el.on('plotly_click', function(d) {
    var point = d.points[0];
    var url = point.data.customdata[point.pointIndex];
    window.open(url);
  });
}"
  )

Fig 2, Color Count per Episode Line Chart

We see an initial upswing in the number of colors used per episode that becomes relatively stable in later seasons. By putting additional information in the hover over tooltip, we ease the cognitive burden of the audience. Additionally, simple JavaScript functions can extend the visualization functionality with options such as the one demonstrated here for linking to the described painting.



Did the ratio of colors change over time?

Density plot of individual color usage rates by season

library(ggridges)

bob_ross_df %>% 
  select(season,
         Black_Gesso:Alizarin_Crimson) %>% 
  pivot_longer(cols = !matches('season')) %>% 
  group_by(season,
           name) %>% 
  summarise(use_rate = mean(value)) %>% 
  ungroup() %>% 
  group_by(name) %>% 
  mutate(name_srt = mean(use_rate)) %>% 
  ungroup() %>% 
  ggplot(aes(x = use_rate,
             y = fct_reorder(name, name_srt))) +
  geom_density_ridges_gradient(aes(fill = after_stat(x)),
                               scale = 1.5,
                               color = '#FFFFFF') +
  scale_fill_gradient(low = '#C5CAE9',
                      high = '#1A237E',
                      name = 'Use Rate') +
  labs(title = 'Color Use Rate per Season Distribution',
       x = 'Use Rate per Season',
       y = NULL)
Fig 3.1, Color Use Distribution Density Ridge Plot

Fig 3.1, Color Use Distribution Density Ridge Plot

This is a density ridge plot with a gradient fill that shows the distribution of color usage rates per season across all colors. We can see that Titanium White was highly used, Indian Red infrequently used, and Phthalo Blue had a highly variable use per season.


colors_per_season_df <-
  bob_ross_df %>%
  group_by(season) %>%
  summarise(across(c(Black_Gesso:Alizarin_Crimson),
                   ~ mean(.))) %>%
  ungroup() %>%
  pivot_longer(cols = !contains('season')) %>%
  mutate(name = str_replace_all(name, '_', ' '))

colors_df <-
  bob_ross_df %>%
  select(colors,
         color_hex) %>%
  mutate(across(everything(),
                ~ str_extract(., "'.*'[,|\\]]"))) %>%
  mutate(across(everything(),
                ~ str_remove_all(., "\\\\r|\\\\n|'|\\]"))) %>%
  separate_longer_delim(cols = everything(),
                        delim = ',') %>%
  mutate(across(everything(),
                ~ str_squish(.))) %>%
  distinct() %>%
  rename(name = colors)

colors_per_season_df %>%
  left_join(colors_df,
            na_matches = 'never') %>%
  highlight_key(~ name) %>% 
  plot_ly(x = ~ season,
          y = ~ value) %>%
  add_bars(
    color = ~ I(color_hex),
    marker = list(line = list(width = 1,
                              color = 'black')),
    text = ~ paste0(
      'Season: <b>',
      season,
      '</b><br>',
      'Color: <b>',
      name,
      '</b><br>',
      'Use Rate: <b>',
      round(value, 2),
      '</b>'
    ),
    hoverinfo = 'text',
    textposition = 'none'
  ) %>%
  layout(barmode = 'stack',
         title = list(text = '<b>Color Usage Rate per Season</b>\n<i> hover over bars for more info, click bar to highlight</i>',
                      x = 0.05),
         xaxis = list(title = '<b>Season</b>'),
         yaxis = list(title = '<b>Color Usage Rate</b>'))

Fig 3.2, Color Use Rate Stacked Barchart

By using relatively simple regular expressions (regex) supported in R, we can extract the actual colors by their hexadecimal code and map them to their respective bars in this visualization.



What were common title topics?

library(tidytext)

title_tokens_df <-
  bob_ross_df %>%
  select(season,
         painting_title) %>%
  unnest_tokens(input = painting_title,
                output = 'word',
                drop = FALSE)

title_tokens_df %>%
  anti_join(stop_words) %>%
  mutate(word = str_remove(word, 's$')) %>%
  count(word) %>%
  slice_max(order_by = n,
            n = 25) %>%
  plot_ly(x = ~ n,
          y = ~ fct_reorder(word, n)) %>%
  add_bars(
    color = I('#1A237E'),
    text = ~ paste0('Word: <b>', word, '</b><br>',
                    'Count: <b>', n, '</b>'),
    hoverinfo = 'text',
    textposition = 'none'
  ) %>%
  layout(
    title = list(text = '<b>Frequently Used Words in Painting Title</b>',
                 x = 0.05),
    xaxis = list(title = '<b>Word Count</b>'),
    yaxis = list(
      title = FALSE,
      tickprefix = '<b>',
      ticksuffix = '</b> '
    )
  )

Fig 4, Top Painting Title Term Frequency Barchart

The tidytext package greatly simplifies the steps required for text mining. Here we can easily unnest the painting titles into columns of their discrete words for further analysis.



Did topics change by season?

library(tidylo)

title_lo_top_df <-
  title_tokens_df %>%
  mutate(word = str_remove(word, 's$')) %>%
  count(season,
        word) %>%
  bind_log_odds(set = season,
                feature = word,
                n = n) %>%
  anti_join(stop_words) %>%
  slice_max(order_by = log_odds_weighted,
            n = 50)

title_lo_top_df %>%
  ggplot(aes(x = log_odds_weighted,
             y = fct_reorder(word, log_odds_weighted))) +
  geom_col(fill = '#1A237E') +
  facet_wrap(vars(season),
             scales = 'free') +
  theme(axis.title.y = element_blank())
Fig 5.1, Top Topic Weighted Log Odds by Season Trellis Chart

Fig 5.1, Top Topic Weighted Log Odds by Season Trellis Chart

The tidylo package is small but powerful in that it enable the calculation of weighted log odds with uninformative Dirichlet priors. This gives an improvement when compared to the tidytext package’s Term Frequency vs Inverse Document Frequency (TF-IDF) when dealing with small term frequencies. It is worth noting that recent research indicates that the inclusion of stop words in the initial calculation or further modelling efforts generates more accurate results, however it is still good practice to remove them when communicating results to lay audiences to prevent misinterpretation.


title_lo_top_df %>%
  group_by(season) %>%
  group_by(word) %>%
  mutate(word_srt = mean(log_odds_weighted)) %>%
  ungroup() %>%
  plot_ly(x = ~ season,
          y = ~ fct_reorder(word, word_srt)) %>%
  add_heatmap(
    z = ~ log_odds_weighted,
    xgap = 1,
    ygap = 1,
    text = ~ paste0(
      'Season: <b>',
      season,
      '</b><br>',
      'Title Word: <b>',
      word,
      '</b><br>',
      'Wt. Log Odds: <b>',
      round(log_odds_weighted, 2),
      '</b>'
    ),
    hoverinfo = 'text'
  ) %>%
  layout(title = list(text = '<b>Topic Heatmap</b>',
                      x = 0.05),
         yaxis = list(title = FALSE),
         xaxis = list(title = '<b>Season</b>'))

Fig 5.2, Top Topic Weighted Log Odds Heatmap

This heatmap is an alternate visual for showing important topics by season. The advantage here is the ability to follow a topic or season across an axis, however the trade off is a slight decrease in readability.



Predictive Model

Can we predict if a painting is a Winter scene from the range of colors used?

First we prepare a dataframe for modelling

library(tidymodels)
library(themis)

# prepare dataframe for modelling 
model_data_df <-
  bob_ross_df %>%
  select(painting_title,
         Black_Gesso:Alizarin_Crimson) %>%
  mutate(has_winter = if_else(str_detect(str_to_lower(painting_title),
                                           'winter'),
                                '1_winter',
                                '2_no_winter')) %>%
  select(-painting_title) %>%
  mutate(across(c(Black_Gesso:Alizarin_Crimson),
                ~ if_else(., 1, 0)))

glimpse(model_data_df)
## Rows: 403
## Columns: 19
## $ Black_Gesso      <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ Bright_Red       <dbl> 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1,…
## $ Burnt_Umber      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1,…
## $ Cadmium_Yellow   <dbl> 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1,…
## $ Dark_Sienna      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Indian_Red       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Indian_Yellow    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ Liquid_Black     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Liquid_Clear     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Midnight_Black   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Phthalo_Blue     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1,…
## $ Phthalo_Green    <dbl> 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1,…
## $ Prussian_Blue    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Sap_Green        <dbl> 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1,…
## $ Titanium_White   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Van_Dyke_Brown   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ Yellow_Ochre     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1,…
## $ Alizarin_Crimson <dbl> 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,…
## $ has_winter       <chr> "2_no_winter", "2_no_winter", "2_no_winter", "1_winte…

Creating a separate dataframe of only the outcome we want to predict and the features we would like to use for prediction eases additional downstream transformations. Depending on the final modelling algorithm used, this can be an iterative step.



Next we split the data into training and testing sets

# create training and testing data splits

set.seed(42)

raw_split <- model_data_df %>%
  initial_split(strata = has_winter)

raw_train <- training(raw_split)

raw_test <- testing(raw_split)

raw_split
## <Training/Testing/Total>
## <301/102/403>

Here we see the ~ 70/30 training/testing split of our data. We want to train and assess the potential model while keeping the testing portion as a holdout to be used only once for final model evaluation.



Now we create additional bootstrap folds for training and cross validation

# create bootstrap data folds for training and cross validation assessment

set.seed(123)

raw_folds <- vfold_cv(raw_train,
                      strata = has_winter)

raw_folds
## #  10-fold cross-validation using stratification 
## # A tibble: 10 × 2
##    splits           id    
##    <list>           <chr> 
##  1 <split [270/31]> Fold01
##  2 <split [271/30]> Fold02
##  3 <split [271/30]> Fold03
##  4 <split [271/30]> Fold04
##  5 <split [271/30]> Fold05
##  6 <split [271/30]> Fold06
##  7 <split [271/30]> Fold07
##  8 <split [271/30]> Fold08
##  9 <split [271/30]> Fold09
## 10 <split [271/30]> Fold10

Because this dataset is so small, we will create additional folds of the dataset. This is done through bootstrap resampling and will result in number of folds used to assess and modify the model during training through cross-fold validation.



Then we create a model workflow comprised of a step recipe and model specs

# create model step recipe
raw_rec <-
  recipe(has_winter ~ .,
                  data = raw_train) %>%
  step_upsample(has_winter) %>%
  step_dummy(all_nominal_predictors())

# set model algorithm and specs
xgb_spec <-
  boost_tree(
    trees = tune(),
    mtry = tune(),
    min_n = tune(),
    learn_rate = 0.1
  ) %>%
  set_engine('xgboost') %>%
  set_mode('classification')

# create model workflow
xgb_wf <- workflow(raw_rec, xgb_spec)

xgb_wf
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: boost_tree()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 2 Recipe Steps
## 
## • step_upsample()
## • step_dummy()
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Boosted Tree Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = tune()
##   min_n = tune()
##   learn_rate = 0.1
## 
## Computational engine: xgboost

The tidymodels metapackage makes it easy to string together the common modelling steps into recipes and workflows that are easy for humans to read and comprehend. Here we have a dataset well suited to decision tree models and will be using boosted decision trees with the popular XGBoost algorithm. This algorithm has a large number of hyperparameters we could tune to change model performance, however here we will use some of the built-in package functionality to tune the major hyperparameters for us on the already built training folds.



Now we tune the model hyperparameters

Here we can see the results of the model tuning

library(finetune)

# turn on parallel processing
doParallel::registerDoParallel()

# Tune the model hyperparameters
set.seed(628)

xgb_raw_rs <-
  tune_race_anova(
    xgb_wf,
    raw_folds,
    grid = 20,
    control = control_race(verbose_elim = TRUE)
  )

plot_race(xgb_raw_rs)
Fig 6.1, ANOVA Race Tuning

Fig 6.1, ANOVA Race Tuning

show_best(xgb_raw_rs)
## # A tibble: 5 × 9
##    mtry trees min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     9   206    19 roc_auc binary     0.878    10  0.0308 Preprocessor1_Model10
## 2     1   465    20 roc_auc binary     0.877    10  0.0310 Preprocessor1_Model01
## 3    14   877    31 roc_auc binary     0.875    10  0.0304 Preprocessor1_Model16
## 4     7   104    16 roc_auc binary     0.874    10  0.0308 Preprocessor1_Model08
## 5     5   379    37 roc_auc binary     0.874    10  0.0334 Preprocessor1_Model05

The Race ANOVA tuning function is nice in that as hyperparameter combinations are shown to be less performant they are dropped from further modelling reducing total time and resources. Here we can see the separate trials and the best performing variants.



Finally we fit the model to our testing holdout

xgb_last <-
  xgb_wf %>%
  finalize_workflow(select_best(xgb_raw_rs, 'roc_auc')) %>%
  last_fit(raw_split)

xgb_fit <- extract_fit_parsnip(xgb_last)

final_preds <-
  xgb_last %>%
  collect_predictions()

final_preds %>%
  roc_curve(has_winter, .pred_1_winter) %>%
  plot_ly(
    x = ~ 1 - specificity,
    y = ~ sensitivity,
    showlegend = FALSE
  ) %>%
  add_lines(
    color = I('#1A237E'),
    line = list(width = 4),
    text = ~ paste0(
      'Specificity: <b>',
      round(specificity, 3),
      '</b><br>',
      'Sensitivity: <b>',
      round(sensitivity, 3),
      '</b>'
    ),
    hoverinfo = 'text'
  ) %>%
  add_segments(
    x = 0,
    xend = 1,
    y = 0,
    yend = 1,
    line = list(dash = 'dash'),
    color = I('#BDBDBD')
  ) %>%
  layout(xaxis = list(scaleanchor = 'y',
                      scaleratio = 1),
         title = list(text = '<b>ROC curve</b>',
                      x = 0.05))

Fig 6.2, Final Model Fit AUC Chart

collect_metrics(xgb_last)
## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.833 Preprocessor1_Model1
## 2 roc_auc  binary         0.939 Preprocessor1_Model1
final_preds %>%
  conf_mat(has_winter, .pred_class)
##              Truth
## Prediction    1_winter 2_no_winter
##   1_winter          11          17
##   2_no_winter        0          74

We can see that our model seems to perform well, however we would potentially need to modify further depending on how the model will be deployed. Decision thresholds could be customized to decrease the number of false positives at the cost of increasing false negatives should absolute certainty be needed in prediction of a Winter painting.
Again, these changes should be made using training data during the assessment stage and prior to this final fit but were not shown here for the sake of demonstration length.



Lastly we explain the model

library(SHAPforxgboost)

FTR_Shap <-
  shap.prep(
    xgb_model = extract_fit_engine(xgb_fit),
    X_train = bake(
      prep(raw_rec),
      has_role('predictor'),
      new_data = NULL,
      composition = 'matrix'
    )
  )

shap.plot.summary(FTR_Shap) +
  labs(title = 'Color Use Impact on Likelihood of Winter Painting') +
  theme(title = element_text(face = 'bold'),
        axis.text.y = element_text(face = 'bold'))
Fig 6.3, Final Model Fit SHAP Beeswarm Summary Plot

Fig 6.3, Final Model Fit SHAP Beeswarm Summary Plot

Decisions trees are highly useful algorithms when model explainability is a must. SHAP values have recently been making a comeback as a way to explain both variable importance and impact on model outputs, but also variable interactions. The SHAPforxgboost package gives access to a number of data visualizations for SHAP values.



Summary

EDA Results
Over the course of 31 seasons comprised of 13 episodes each, Bob Ross painted a total of 403 paintings!
His subjects ranged from forest cabins to winter landscape scenes, and he was particularly fond of painting mountains. The topics he chose to paint were broad, but occasionally followed predictable patterns with several topics presenting across multiple seasons. Season six however seemed to be a turning point as the number of topics he painted varied wildly.
Bob’s choice of colors ramped up over the first few seasons, but then was relatively steady through the full production run. His use of color seems to be associated with his choice of topic, with an initial model predicting Winter paintings showing promising early results.


Recommendations
Further analysis is warranted regarding the predictability of all common topics given the color usage. Decision thresholds for predictions may need be to be customized as this is a small dataset with wide ranging topics.
A literature review could also be helpful in identifying additional data regarding his changing choice of topics and slightly evolving use of colors and styles.



End of Document